1 Pre-session reading

Time - 30-60 minutes

1.1 Recap

To briefly recap, last week we:

  • Refresh basic statistics (normal distribution, comparing means, etc)
  • Discussed what a GLM is, and how does it related to simple statistics like linear regressions
  • Discussed error distributions & link functions
  • Covered techniques if we dont have normally distributed data
  • Learned how to fit and visualise a GLM

Today we are going to move onto the big brother/sister of GLMs, GLMMs…

1.2 Mixed effects models - GLMMs

Like GLMs, Generalized linear mixed-effect model (GLMMs) allow us to model data which are non-normally distributed. However, they also allow us to deal with data which have underlying structures which might influence the patterns and thus the inferences we make from out statistical analyses.

Remember, in the fixed effects models (e.g. GLMs) we assume that the only source of variance is the random sampling we do of the response variable. For instance, measuring the sepal width on the Irises we considered last week. When we say random sampling, we mean that you have not gone out and only sample individuals which are say over 50cm in height.

However, many data might have variance which is potentially explained by systematic differences. These effects could be differences between sites where we have collected data, data collected from different populations of the same species, etc and mean that there might be additional structure in the data which is not explained by our fixed effects.

To take an example from the Iris data from last week, Irises may grow better at some field sites than others, and these differences might affect the relationship between sepal width and sepal length. To accurately estimate how sepal length affects sepal width we want to account for those underlying differences and estimate the effect independently of the site effect. We can do this by including random effects in our model.

Random effects are complicated to get your head around, but there is a fantastic visual explanation of the effects of data grouping available here which I recommend going over - it makes GLMMs very intuitive.

GLMMs are a pain to fit well, and complex to investigate/assess - BUT they are incredibly useful and powerful. A very very good place for advice is Ben Bolker’s excellent post on frequently asked questions for GLMMs available here. There is another complimentary post where he works through a series of GLMM examples available here.

1.3 Types of random effect

1.3.1 Random slopes vs Random intercepts

There are two kinds of random effect - random slopes, and random intercepts. In most cases we specify a random intercept - i.e. we assume that there are differences between the groups in where the regression line fit through the data cross the y axis, but there aren’t differences between the groups in the rate of change of that regression line (the slope):

You can also specify the model to have random slopes, i.e. the model will have a fixed y intercept, but the slope will be allowed to vary based on some data structure (e.g. the site the data was collected at).

We can even (although this is rare) fit models where both the slope and the intercept vary:

As I said earlier, most often we fit models with a random intercept as we assume that the is going to be the same relationship between the x and y variables, but that the magnitude of pattern might change systematically. Again, for an excellent visual walk through of these concepts see here.

1.3.2 When is it a random effect? When is it a fixed effect?

This sounds like a trivial question, but it isn’t.

Typically:

Fixed effects are things you have measured (e.g. the temperature at a site, the soil pH)

Random effects account for things which might change but which you haven’t (or can’t) explicitly measured (e.g. you might assume that there are environmental differences between different sites, but you have not explicitly measured these differences)

However - random effects can often be included as fixed effects in a model, for example if you collect data on a species’ abundance over 10 years from six different sites, you might want to include site as a random effect (to look at the trend in this species over time), OR as a fixed effect to look to see whether there are significant differences between the sites.

1.3.2.1 So how do we choose?

This is often more of a philosophical question than a scientific one, but generally:

  • What are we interested in?
    • Can’t calculate p values for random effects *Do we only have a small amount of data?
    • Fitting a factor as a fixed effect takes a lot of degrees of freedom
    • If we have lots of degrees of freedom we need lots of response data (i.e. your observed y variable)
  • Do you have enough levels in the data to be a random effect?
    • consensus seems to be >5 ish
  • Random effects should NOT be a continuous variable (e.g. temperature, year, time…)
    • R fits random effects as factors, which isn’t appropriate if your variable is continuous

1.3.2.2 A decision tree for random effects

The below is taken from the excellent blog Dynamic Ecology, and their post on random effects is worth a read (see here):

1.3.3 Crossed vs nested effects

So, you’ve decided you need to include random effects in your model. However, you still have one more choice to make. Are these random effects crossed or nested?

1.3.3.1 Nested random effects

Nested random effects allow us to specify that some random effects should be considered within a larger random effect.

A easy way to think about this is the scores of students in a class. Imagine 3 classes of students studying maths in 3 different schools:

We might expect there to be non-random changes in the variance in our data between different classes (assuming they have different teachers) and also between the different schools (some might be better funded than others, or have better reputations and thus attract smarter children).

In this instance it would be appropriate to nest the effect of Class within the larger effect of School.

1.3.3.2 Crossed random effects

Crossed random effects occur where you have two or more variables with distinct can create distinct groupings. For example, consider we have three sites we collect data from, and two different observers collecting species diversity data. Both of the observers collect data at both of the sites.

In this example we might think that there are some systematic differences between the sites we want to account for, but also some systematic differences between the observers (some people might be better at spotting species than others).

However, here we don’t have any obvious nesting (both of the observers collect data at both sites). So in this case, you would want to specify crossed effects.

Whether a random effect is nested or crossed can depend upon what scale you are looking at, so it isn’t necessarily trivial to decide whether to nest or cross your random effects (and a full discussion is beyond today’s work) but there are some resources to help you get your head around it at the bottom of this workbook.

2 Today’s practical

So today we are going to go through the process of fitting a GLMM (and see how it differs, or not, from GLMS), and learn some tools for assessing the fit of these models. This is a relatively quick intro rather than an in depth dive - there is far too much to learn in 3 hours- but this will give you the skills to fit GLMMs and learn how to learn statistics in R.

2.0.1 Visualise the data

Time - 10 minutes

Today we are going to use the species and stressors data base we used last week in the GLM session.

Again, we will add in a standardised time column to our data, starting at 0 weeks for the earliest recorded abundance for each population of each species:

## calculate standarised time per pop and per species. 
long_spp <-  long_spp %>% 
                group_by(species, population) %>%
                mutate(standardised_time = as.numeric(difftime(as.Date(date), 
                                                               min(as.Date(date)), 
                                                               units="weeks")))

## just look four of the columns to visualise the data 
## (you can look at all in Rstudio) but printing all the columns in 
## Rmarkdown means we won't see the standardised_time column
print(long_spp[,c("species", "population", "abundance", "standardised_time")], 10)
## # A
## #   tibble:
## #   696 ×
## #   4
## # Groups:  
## #   species,
## #   population
## #   [59]
##    species
##    <chr>  
##  1 Schist…
##  2 Schist…
##  3 Schist…
##  4 Schist…
##  5 Schist…
##  6 Schist…
##  7 Schist…
##  8 Parale…
##  9 Parale…
## 10 Parale…
## # … with
## #   686
## #   more
## #   rows,
## #   and 3
## #   more
## #   variables: …

Task

There are a number of ways we could create the standardised time column (e.g. with week 0 set to the earliest recorded abundance of any species and population in the data, or creating the standardised time as we have done above where each species population has its own week 0). Each of these has different implications and makes different assumptions about what patterns we are looking for in the data. Try and identify what the implications of choosing these different standardized times would be.


In last week’s workbook I gave a possible solution for the species data visualisation task you were given:

## make a new column in long_spp which is the number of threats recorded for each population
##this is the length of the non-NA values in each row of the colunns primary_threat, secondary_threat, and tertiary_threat: 
long_spp <- long_spp %>% 
              rowwise %>%
              mutate(n.threats = length(which(!is.na(c(primary_threat, secondary_threat, tertiary_threat)))))

##load in the scales package for the psudo_log transformation
library(scales)

##build the ggplot, setx§ing the date to a date format
ggplot(long_spp, aes(x=as.Date(date), 
                     y=abundance, 
                     col=n.threats))+
  ##set the theme to minimal
  theme_minimal() +
  ##position the legend at the top on the left
  theme(legend.position = "top", 
        legend.justification="left") +
  ## change the automatic label for the legend to the following
  ## note it is done twice (once for fill and once for col) as 
  ## we change the colour of the lines and the fill of the smooth according to the number of threats
  labs(col="Number of threats", fill="Number of threats") +
  ## add in the lines of the population, using the group and interaction functions
  ## to specify how the data are grouped and get 1 line per population 
  geom_line(aes(group = interaction(population, 
                                    species, 
                                    n.threats, 
                                    population)), 
                alpha=0.2, 
                size=1) + 
  ## add in a smoothed line across all of the population time series for each number of threats
  ## this visualises the mean change across the populations
  geom_smooth(aes(group = n.threats, 
                  fill = n.threats), alpha=0.3, size=1) +
  ##change the colours to reflect the number of threats, using a gradient
  scale_color_gradient(low = "#745745", 
                       high = "#d14124") +
  scale_fill_gradient(low = "#745745", 
                      high = "#d14124") +
  ## transform the y axis using a psudo_log transformation 
  ## (as we have some 0 counts and you can't log 0)
  scale_y_continuous(trans="pseudo_log") +
  ##change the x and y axis labels
  ylab("Population abundance") +
  xlab("standardised_time")

We will set out to see - statistically - if there is any effect of the number of stressors a population is affect on the rate of decline of those populations (as in the plot above), whilst accounting for the fact that this data covers different species.


Task

Implement the code (above) to calculate the number of stressors for each population of each species.


Great, so we now have a data frame with standardised times and the number of stressors added to it, let’s first visualise the data:

## we will specify the x and y asethics here but specify the groups for the lines later
## this is because doing so means that the data for the geom_smooth() will then be all of the 
## data for each of the facets, but we can specify grouped data for the lines in geom_line
ggplot(long_spp, aes(x = standardised_time,
                     y = abundance)) + 
  ##add the points. we can use the group function to specify
  ## groups within the data (in this case species and population)
  ## that we want to treat as seperate data for the sake of plotting
  ## the `interaction()` function allows us to specify multiple levels to these data
  geom_line(aes(group = interaction(species, population))) + 
  facet_wrap(~n.threats) +
  theme_minimal() +
  ##fit a linear regression to the data to help visualise
  geom_smooth(method = "lm")


Task

Run the below code to see the range of trajectories of the different populations:

## I havent run this, but here is an example of what happens when we group 
## in the initial ggplot() function rather than in the geom_line()
## try running it and seeing what happens
ggplot(long_spp, aes(x = standardised_time,
                     y = abundance,
                     group = interaction(species, population))) + 
  geom_line() + 
  facet_wrap(~n.threats) +
  theme_minimal() +
  geom_smooth(method = "lm")

So, eyeballing it it looks like there might be some differences between the treatments (number of threats) - although if you run the second chunk of code above where a linear model is fit to each time series you will see that there is some variation in those rates of change through time.

2.0.2 Model structure

Time - 5 minutes

Ok, let’s think about the modeling structure. We know we have the following variables in the data which we might want to consider:

  1. abundance - this will be our response variable
  2. standardised_time - we want to understand whether the number of threats affects the populations through time, so we still need to include the time element
  3. n.threats - the number of threats
  4. species - looking at the above plots it seems like the species are potentially behaving in slightly different ways
  5. population - for some of the species we have multiple populations, which we might want to consider seperatly.

Great so from this we can already start to think about a model structure based on what we are interested. We know its going to be something like:

abundance ~ standardised_time + n.threats

But remember we want to see whether the number of threats affects how population abundance changes through time, so we are going to need to consider an interaction between standardised_time and n.threats. We specify an interaction in R using ::

abundance ~ standardised_time + n.threats + standardised_time:n.threats

Great. Now let’s consider our other variables - species and population. We could consider species (and indeed population) as fixed effects in the model - if we were interested in them as part of the main part of our question. However, what we really want to know is what the effect of the number of threats is on abundance, we aren’t interested in the species effects per say.

This is where mixed models come to the fore - we want to look at what the effects of standardised_time and n.threats whilst accounting for possible other variables (species & population) which might explain some variation in the data but which we aren’t interested in the effects of.

There are a number of packages for fitting GLMMs in R - we are going to use the glmmTMB package because its really fast so is great for big data sets, and has good online help/tutorials.


Task

find, install, and load the glmmTMB package.


## Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
## glmmTMB was built with TMB version 1.7.21
## Current TMB version is 1.7.22
## Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)

2.0.3 Coding mixed effects models

Time - 15 minutes

There are some differences between how GLMMs are coded between the different packages, but generally:

  • you specify random intercept effects using + (1 | group)
  • you specify random slope but fixed intercept using + (0 + x | group)
  • you specify random slope and intercept effects using + (x | group)

We are going to fit a random intercept model with intercepts for each of the species in our data using glmmTMB. i.e. we are assuming that the the number of stressors has an equivelantly negative effect on all species (the slopes are the same) but that the intercept (abundances of the populations) are different:

##fit a random effects model
m_mod1 <- glmmTMB(abundance ~ 
                    standardised_time + 
                    n.threats + 
                    standardised_time:n.threats + 
                    (1|species), 
                  data=long_spp, 
                  family="gaussian")

##look at the model summary:
summary(m_mod1)
##  Family: gaussian  ( identity )
## Formula:          
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +  
##     (1 | species)
## Data: long_spp
## 
##      AIC      BIC   logLik deviance df.resid 
##   8002.8   8030.1  -3995.4   7990.8      690 
## 
## Random effects:
## 
## Conditional model:
##  Groups   Name        Variance Std.Dev.
##  species  (Intercept) 9602     97.99   
##  Residual             4447     66.69   
## Number of obs: 696, groups:  species, 51
## 
## Dispersion estimate for gaussian family (sigma^2): 4.45e+03 
## 
## Conditional model:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 255.547671  16.536993  15.453   <2e-16 ***
## standardised_time             0.038182   0.018306   2.086    0.037 *  
## n.threats                   -69.127191   5.394216 -12.815   <2e-16 ***
## standardised_time:n.threats  -0.091489   0.009579  -9.551   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So if we look at this output we can see that the amount of the variance the random effect (species) accounts for in our model is given, as is the residual variance. Remember that we are assuming that species (the random effect) accounts for all the variance in the data not explained by our fixed effects (standardised_time and n.threats and their interaction).

We can calculate how much of the variance the random effect accounts for as the variance of the random effect divided by the sum of the random effect variance and residual variance:


Task

Calculate the variance explained by the random effect


9602 / (9602+4447)
## [1] 0.683465

So about 70%.

However, whilst we expect there to be differences in the intercept between the species (as coded before) there is also a good reason to think that there might be differences between populations of the same species as well.

2.0.4 Nested random effects

Time - 5 minutes

In our case we don’t wan’t to consider species and population seperatly because:

  1. we expect that there is larger differences between the different species than between the populations
  2. the differences between populations should be considered within the umbrella effects of the species, as they are all individuals of the same species and so we expect that the populations will behave in similar ways
  3. not all of the species have multiple populations

So back to our example, we want to nest population inside species, and do this as follows:

##fit a random effects model with nested random effects
m_mod2 <- glmmTMB(abundance ~ standardised_time + 
                    n.threats + 
                    standardised_time:n.threats + 
                    (1|species/population), 
                  data=long_spp, 
                  family="gaussian")

##look at the model summary:
summary(m_mod2)
##  Family: gaussian  ( identity )
## Formula:          
## abundance ~ standardised_time + n.threats + standardised_time:n.threats +  
##     (1 | species/population)
## Data: long_spp
## 
##      AIC      BIC   logLik deviance df.resid 
##   7911.9   7943.7  -3949.0   7897.9      689 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev. 
##  population:species (Intercept) 8.988e+03 9.481e+01
##  species            (Intercept) 5.554e-93 7.452e-47
##  Residual                       3.755e+03 6.128e+01
## Number of obs: 696, groups:  population:species, 59; species, 51
## 
## Dispersion estimate for gaussian family (sigma^2): 3.76e+03 
## 
## Conditional model:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 208.55777   18.16210  11.483  < 2e-16 ***
## standardised_time             0.07329    0.01797   4.079 4.53e-05 ***
## n.threats                   -33.47467    8.30603  -4.030 5.57e-05 ***
## standardised_time:n.threats  -0.11393    0.00934 -12.198  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you are working on windows you may find that you are getting NaNs in your summary table. This is (basically) because there are some slight differences in how R fits random effects in MacOS vs Windows, and is not limited to this package (e.g. see here. One way to get around this is by implementing a different maximum likelihood optimizer. Essentially, when you fit a model to data you are searching around for parameter values which make the model fit your data best. When you have lot of variables you have lots and lots and lots of parameters to search through. It would be impossible to look at them all, so glmmTMB uses an optimiser - a short cut to try and find what the “best” parameter values are. There are different optimisers, which seach for these “best” parameters in different ways. We can change it to a different one using:

##fit a random effects model with nested random effects
m_mod2 <- glmmTMB(abundance ~ standardised_time + 
                    n.threats + 
                    standardised_time:n.threats + 
                    (1|species/population), 
                  data=long_spp, 
                  family="gaussian",
                  ##the control parameter specifying the optimizer to use:
                  control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))

Now we can see that we have a new term (population:species) in our “conditional model” and again we can check how much variance is explained by these random effects.


Task

Calculate the variance explained by these random effect


(8.988e+03+5.554e-93)/(8.988e+03+5.554e-93+3.755e+03)
## [1] 0.7053284

Not much more tha just species by itself, HOWEVER we know a priori that population should nest within species (because we expect individuals of the same species to have e.g. similar reproductive rates, behaviours, mortality rates etc) and so we should include this in the model, regardless of whether it explains more of the variance (although we would expect it would).

2.0.5 Crossed random effects

As we discused earlier, you can also specify crossed random effects, where two or more variables can create distinct groupings. To do this is simple:

y ~ x1 + x2 + (1|z1) + (1|z2)

Where z1 is one random effect, and z2 is another.

However, we will stick with our nested model design as there isn’t any variable in the data which would suggest that we should include a crossed design.

2.0.6 Assessing GLMM model fits

Time - 30 minutes

So, our model is fairly simple:

abundance ~ standardised_time + n.threats + standardised_time:n.threats + (1|species/population)

So far we have considered what random effects should be in the model, but haven’t actually looked at the model fit! The basic approach of plotting a QQ plot we did above doesn’t mathmatically work well for GLMMs, however there are some handy tools in the DHARMa package which can help us assess the model fit more robustly. These work by using the model fit to simulate residuals which incorporate the uncertainty in the model, see ?simulateResiduals for a good explanation.

##load DHARMa
library(DHARMa)

## simulate the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
m_mod2_sim <- simulateResiduals(m_mod2, n = 1000)

##plot out the residuals
plot(m_mod2_sim)

Ok so what do these plots tell us? Well on the left is a version of the QQ plot we saw earlier, and we read it in exactly the same way as a normal QQ plot. We also get some statistical tests to see how well our data fit the distribution - if we have a significant (p<0.05) test score then our data are significantly DIFFERENT from the distribution specified in the model - you can see here that this is the case.

On the right we have a version of the residuals vs predicted plot - we are looking for solid red lines which are close to horizontal (actually close to tracking the dashed red horizontal lines at 0.25, 0.5, and 0.75 - the quartiles of the data). Again, we can see our data are a poor fit and we get a warning about the quantile deviations in the title.

Conclusion? Our model is doing a poor job of fitting to the data.


Task

Head back to the online qqplot visualisations and play with the sliders and see if you can determine what sort of skew in the data we are dealing with.


So looking at the online model I would say we are looking at long tailed positively skewed data when compared to a normal distribution. Let’s see if we can solve this.

This is where understanding our data is important - why are we seeing these patterns? Well, one reason is because of this:

We have time series which go extinct, and then we have a series of 0’s recorded until the end of the time series (plotted as points in the graphic above).


Task

Consider whether we should include these 0’s in our analysis.








Well, let’s think about fitting a linear regression to the data above, where we either (1) include the zeros (red line in the plot below) or (2) don’t include them (yellow line):

You can see that - quite obviously - there is a big difference in our estimates of the rate of change of these populations over time and I hope that it should be obvious that we don’t want to include all of the 0’s in the analysis.

So do we just remove all of the 0s?

Well…no, because the first 0 (in chronological order) still gives us valid and important information on how fast the population is declining, so we should just exclude all the rest of the 0’s after that.

Hypothetical question

Should we include the zeros if we have a non-zero abundance at the end of the time series (rather than the populations going extinct as in our time series)?

So how do we go about doing this?

Well, first we need to write a function to return the data we want (only one 0). Luckily our data finish with 0, 0, 0, 0 etc and don’t have something like 0, 0, 1, 0, 0 (which would make things trickier to code). This means we can tell R to return the data up to the first instance of a 0 being present:

##a function to return data up to and including the first 0 count:
single_zero <- function(x, group_info){
  if(nrow(x)>0){
    x <- x[ order(x$standardised_time) ,]
    if( min(x$abundance) == 0){
      return( x[1:min(which(x$abundance == 0)),] )
    }else{return(as_tibble(x))}
  }
}

Task

See if you can work out what the above function does. I suggest running it on a small subset of the full data, e.g. by using the [[2]] approach above to select out a data frame from the split_spp list. Hint - the group_info section in function(x, group_info) is necessary for the group_map() function to work, so you can ignore this.


Ok, so now we have a function (and you have checked it works using a subset of the data) then we can apply this function to our list. The tidyverse provides tools to do that - namely the group_map() function:

## make a new data frame
species_single0 <- long_spp %>% 
                      ##we want to cut the long_spp data up by the values in species and population
                      group_by(species, population) %>% 
                      ##the group_map function allows us to apply a function to these groups
                      ##and we want to keep the grouping variables, hence keep = T
                      group_map(single_zero, .keep = T) %>% 
                      ##and then bind it all back together again to a tibble
                      ##otherwise it is returned as a list of the group data
                      bind_rows()

Let’s check whether this has worked by looking at the data we plotted earlier

## # A tibble: 9 × 3
##   species            standardised_time abundance
##   <chr>                          <dbl>     <dbl>
## 1 Acaulon triquetrum               0         189
## 2 Acaulon triquetrum              52.3       114
## 3 Acaulon triquetrum             104.         88
## 4 Acaulon triquetrum             157.         36
## 5 Acaulon triquetrum             209.          3
## 6 Acaulon triquetrum             261           1
## 7 Acaulon triquetrum             313.          1
## 8 Acaulon triquetrum             365.          1
## 9 Acaulon triquetrum             417.          0

Great! It ends in a single 0. We can continue.

So our model didn’t fit well earlier, but might fit better now we have removed the data that shouldn’t have been there in the first place! So, let’s do some exploration.

So, we could fit a GLMM with a log link (as this is what worked best in the first example on a single species), however becasue we have 0 values in the data (and you can’t log a zero) we need to add one to the data. Additionally, we are using count data, so really the gaussian distribution isn’t a good option (unless the mean of the data are high, as they were when we fit the model to the single time series, but aren’t now we are dealing with values which decline down to 0).

So, what other options do we have? Well, count data are a very common kind of data to have, so luckily there is a raft of tools available for them. The following are some of the most commonly used count data distributions which can be fit in glmmTMB (see here:

  • poisson
  • generalised poisson
  • negative binomial (there are two different parameterizations of this available)
    • nbinom1
    • nbinom2

There are also truncated version of many of the above, but if you plot a histogram of our data then it doesn’t look like our data our truncated:

This is a fairly baffling array of options, but there are some tools we can use to help us assess these data.

Note - all of the below distribution identification techniques work well (often better) for GLMs.

The fitdistrplus library allows us to compare our data to distributions fit to it:

##load the fitdistrplus package
library(fitdistrplus)    

##fit a poisson distribution to our data:
fit_pois <- fitdist(species_single0$abundance, 
                    distr = "pois")

This throws up some warnings. After some digging these are because we have some zero frequencies in our data (we can ignore this).

##plot the data
plot(fit_pois)

In the first plot (on the left) the red density shows the EXPECTED distribution of the data if they were from a poisson distribution, whilst the actual data are shown by the thin black bars - what we are looking for is the bars following the same distribution as the red area. You can instantly see that they aren’t the same, and this is backed up by the right hand plot (the cumulative density function) where the black and red lines should mimic each other.

##look at the summary statistics
gofstat(fit_pois)
## Chi-squared statistic:  Inf 
## Degree of freedom of the Chi-squared distribution:  19 
## Chi-squared p-value:  0 
##    the p-value may be wrong with some theoretical counts < 5  
## Chi-squared table:
##           obscounts   theocounts
## <= 1   2.800000e+01 6.081900e-57
## <= 5   3.300000e+01 2.049747e-50
## <= 11  3.100000e+01 5.019744e-43
## <= 19  3.100000e+01 2.708704e-35
## <= 30  2.700000e+01 5.877842e-27
## <= 41  2.900000e+01 2.226201e-20
## <= 55  2.700000e+01 8.179528e-14
## <= 68  2.800000e+01 4.194862e-09
## <= 81  2.700000e+01 1.849605e-05
## <= 92  2.800000e+01 4.473910e-03
## <= 102 2.700000e+01 2.142684e-01
## <= 121 3.000000e+01 2.940196e+01
## <= 145 3.200000e+01 3.654994e+02
## <= 162 2.900000e+01 1.832680e+02
## <= 187 2.700000e+01 2.156040e+01
## <= 222 2.800000e+01 5.147652e-02
## <= 249 2.700000e+01 6.213563e-08
## <= 293 2.700000e+01 6.661338e-14
## <= 356 2.700000e+01 0.000000e+00
## <= 436 2.700000e+01 0.000000e+00
## > 436  3.000000e+01 0.000000e+00
## 
## Goodness-of-fit criteria
##                                1-mle-pois
## Akaike's Information Criterion   82929.96
## Bayesian Information Criterion   82934.35

This is further backed up by the p-value from the Goodness-of-fit test for poisson distribution - a significant P value here denotes a significant difference from the distribution we are testing the data against (poisson).

Conclusion - the poisson distribution is a bad fit for our data.

What about the negative binomial distribution? This is a great (and more fleixible) option for fitting to count data:

##fit a nbinom to the data instead
fit_nbinom <- fitdist(species_single0$abundance, 
                      dist = 'nbinom')
##again we get warnings from those missing values and can ignore

##plot it out:
plot(fit_nbinom)

That looks a lot better! We are still underpredicting throughout the distribution, but the CDF looks good.

##the goodness of fit
gofstat(fit_nbinom)
## Chi-squared statistic:  31.67363 
## Degree of freedom of the Chi-squared distribution:  18 
## Chi-squared p-value:  0.02402514 
## Chi-squared table:
##        obscounts theocounts
## <= 1    28.00000   19.97953
## <= 5    33.00000   26.85885
## <= 11   31.00000   31.87077
## <= 19   31.00000   35.46763
## <= 30   27.00000   41.34348
## <= 41   29.00000   35.67087
## <= 55   27.00000   39.52625
## <= 68   28.00000   32.15955
## <= 81   27.00000   28.64267
## <= 92   28.00000   21.91913
## <= 102  27.00000   18.32195
## <= 121  30.00000   31.17071
## <= 145  32.00000   33.64150
## <= 162  29.00000   20.60556
## <= 187  27.00000   26.26787
## <= 222  28.00000   30.16536
## <= 249  27.00000   19.02378
## <= 293  27.00000   24.82173
## <= 356  27.00000   25.61594
## <= 436  27.00000   21.19383
## > 436   30.00000   35.73303
## 
## Goodness-of-fit criteria
##                                1-mle-nbinom
## Akaike's Information Criterion     7114.463
## Bayesian Information Criterion     7123.257

But we still have a signifiacant difference from the nbimon distribution (p<0.05). However, remember that above we are looking at the distribution of the WHOLE data, and we need to ensure that the residuals fit the model assumptions. So the above is SUGGESTING that the poisson is unlikely to be a good fit, and the the negative binomial might be better. Better still, we might consider a zero inflated negative binomial - which can account for a larger number of 0s than expected given the negative binomial distribution. So we might like to consider the following distributions for our model:

  1. poisson (unlikely to fit well)
  2. negative binomial
  3. zero inflated negative binomial

Let’s start by fitting the poisson model as a starting point. We are expecting this to fit the data poorly, but we can try it anyway. We are going to used scaled predictor variables.

2.0.7 Scaling data

Just as a quick aside - we often scale() data (so the data have a mean of zero (“centering”) and standard deviation of one (“scaling”)) when the continuous predictor variables we are using are on very different scales. In our case we have one (number of threats) which is between 0 and 3 and one (standardised time) which is between 0 and 991. This can cause issues with the model fitting, as the coefficents for one of your predictors might be vanishingly small (and therefore hard to estimate) when compared to your other predictors.

2.0.8 Fitting a poisson model

Time - 10 minutes

So here we fit our model, with scaled parameters, and population nested within species as our random effects:

## fit a poisson model
ps_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1 | species/population), 
                  data = species_single0,
                  family="poisson")
##summarise the model output
summary(ps_mod)
##  Family: poisson  ( log )
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Data: species_single0
## 
##      AIC      BIC   logLik deviance df.resid 
##  13277.4  13303.8  -6632.7  13265.4      594 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev. 
##  population:species (Intercept) 1.409e+00 1.1870115
##  species            (Intercept) 4.827e-07 0.0006948
## Number of obs: 600, groups:  population:species, 59; species, 51
## 
## Conditional model:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                4.158693   0.156990   26.49  < 2e-16
## scale(standardised_time)                  -0.547610   0.006039  -90.68  < 2e-16
## scale(n.threats)                          -0.995613   0.150798   -6.60 4.05e-11
## scale(standardised_time):scale(n.threats) -0.558804   0.005773  -96.80  < 2e-16
##                                              
## (Intercept)                               ***
## scale(standardised_time)                  ***
## scale(n.threats)                          ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we wanted to look at how well a linear regression fits the data we tend to use r-squared as a goodness of fit measure for our model (an r2 of 1 means the model exactly fits the data).

This measure falls apart when we consider GLMs and even more so when we consider GLMMs but we can calcualte a simple pseudo r2 value to help us roughly estimate the goodness of fit. There are a number of simple and complex ways to do this, but we will just implement the following:

##function to calculate a psudo R2 value for our GLMM
r2.corr.mer <- function(m) {
   lmfit <-  lm(model.response(model.frame(m)) ~ fitted(m))
   summary(lmfit)$r.squared
}

## apply it to our poisson model
r2.corr.mer(ps_mod)
## [1] 0.8672486

0.867 is agood r2. There are also more complex methods e.g. that available in the MuMin package:

##r squared calcualted by MuMIn:
MuMIn::r.squaredGLMM(ps_mod)
##                 R2m       R2c
## delta     0.5182369 0.9981549
## lognormal 0.5182394 0.9981599
## trigamma  0.5182343 0.9981499

Which is lower.

We can also use a simple function to look at our overdispersion in the model:

##function to calculate overdispersion 
overdisp_fun <- function(model) {
    rdf <- df.residual(model)
    rp <- residuals(model,type="pearson")
    Pearson.chisq <- sum(rp^2)
    prat <- Pearson.chisq/rdf
    pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
    c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}

##apply to our data
overdisp_fun(ps_mod)
##      chisq      ratio        rdf          p 
## 9357.20287   15.75287  594.00000    0.00000

The data are considered overdispersed when the chisq value is > the rdf value - i.e. when the ratio is > 1. Clearly these data are overdispersed!

Note - overdispersion is an issue only for distributions which dont estimate a scale parameter (i.e. poisson and binomial).

Clearly this model doesnt fit the data well (even if the r2 is quite high - this is why we need to look at multiple methods to assess the fit of our models, and why r2 is bad for GLMs and GLMMs!).

2.0.9 Fitting alternative models

Time - 15 minutes

So let’s fit some other potential models to the data. We can update() the model structure so we don’t have to keep typing out the full model again:

###try with nbinom: 
nb1_mod <- update(ps_mod, family="nbinom1")

##and the second parameterisation of nb
nb2_mod <- update(ps_mod, family="nbinom2")

We should also consider zero inflated versions of the NB distributions, as we know that we have 0s in our data. glmmTMB allows us to assume that all of the data are equally likely to have zero inflation (using the ziformula = ~1 argument), or alternatively to specify structures in the data which might explain our increase in zeros. In our case, if we look at our data plotted out above, we can see that (and logically this makes sense) the number of threats is likely to be a key driver of the number of zeros, as the number of threats seems to determine whether species are declining or not. In this case we can specify the number of threats as the potential driver of zeros in our data:

##zero inflated version of these with zeros being attributed to number of threats:
Zi_nb1_mod <- update(nb1_mod, ziformula = ~n.threats)
Zi_nb2_mod <- update(nb2_mod, ziformula = ~n.threats)

##and we can also look at the zero inflated version of the poisson model:
Zi_ps_mod <- update(ps_mod, ziformula = ~n.threats)

Ok, so let’s compare the model fits of these three models. We can do this using the anova() (analysis of variance) function which gives us additional information on top of what AIC() returns to us:

##compare the models
anova(ps_mod, 
      Zi_ps_mod,
      nb1_mod, 
      nb2_mod,
      Zi_nb1_mod,
      Zi_nb2_mod)

We can see the zero inflated version NB1 model (Zi_nb1_mod) seems to fit the best of the models tested (with Zi_nb1_mod having the (lowest AIC) and if we look at the psudo-r2 values we can see that it is still fitting well:


Task

Use the r2.corr.mer() function to calculate the psudo r2 value for the Zi_nb1_mod.


##psudo r2 value
r2.corr.mer(Zi_nb1_mod)
## [1] 0.8599625

Sadly the MuMIn::r.squaredGLMM approach we tried earlier doesn’t (yet) handle zero inflated version of the NB in glmmTMB, however this model is looking like a pretty good fit for our data.

2.0.10 Temporal correlations in data

Time - 5 minutes

One other thing we might want to consider is correlations structures in our data. In this case we are fitting models to time series and because the value of any given time point relies on the value of the previous time point, we might need to consider some underlying stucture (autocorrelation) in the data we are analysing. We can impliment this in glmmTMB:

##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(factor(scale(standardised_time)) - 1|species/population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1")
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem; false convergence (8).
## See vignette('troubleshooting')

Note - we have to include the predictor for ar1() as a factor (thats just the way that glmmTMB needs it to be, don’t ask!), hence the factor(scale(standardised_time)).

Trying to fit this throws up a number of warnings/errors. If you run vignette('troubleshooting') you get some very useful help on these warnings. We can solve the main one (false convergence) fairly easily, by implementing a different maximum likelihood optimizer:

##fit a zero inflated negative binomial model with autocorrelation structure defined by the time variable
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|species/population) + ar1(factor(scale(standardised_time))-1|species/population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1",
                  ##the control parameter specifying the optimizer to use:
                  control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')

which solves one of the issues, but not:

Model convergence problem; non-positive-definite Hessian matrix

The troubleshooting vignette gives the following as probable reasons for this warning:

  1. when a model is overparameterized (i.e. the data does not contain enough information to estimate the parameters reliably)
  2. when a random-effect variance is estimated to be zero, or random-effect terms are estimated to be perfectly correlated (“singular fit”: often caused by having too many levels of the random-effect grouping variable)
  3. when zero-inflation is estimated to be near zero (a strongly negative zero-inflation parameter)
  4. when dispersion is estimated to be near zero
  5. when complete separation occurs in a binomial model: some categories in the model contain proportions that are either all 0 or all 1

If we look at the fixed effects of the model:

##show the fixed effects
fixef(Zi_nb1_ar1_mod)
## 
## Conditional model:
##                               (Intercept)  
##                                    4.2298  
##                  scale(standardised_time)  
##                                   -0.6811  
##                          scale(n.threats)  
##                                   -0.8801  
## scale(standardised_time):scale(n.threats)  
##                                   -0.6365  
## 
## Zero-inflation model:
## (Intercept)    n.threats  
##      -7.539        1.557

You can see that the zero inflation parameters (number 3) are not near zero. We also don’t have a dispersion parameter in the model (4) and we aren’t fitting a binomial model (5). Which leaves us with (1) and (2). If we look at the random effect sizes from the zero inflated negative binomial model without autocorrelation:

##look at the model
Zi_nb1_mod
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Zero inflation:             ~n.threats
## Data: species_single0
##       AIC       BIC    logLik  df.resid 
##  6041.627  6081.199 -3011.813       591 
## Random-effects (co)variances:
## 
## Conditional model:
##  Groups             Name        Std.Dev.
##  population:species (Intercept) 0.707994
##  species            (Intercept) 0.000331
## 
## Number of obs: 600 / Conditional model: population:species, 59; species, 51
## 
## Dispersion parameter for nbinom1 family (): 15.2 
## 
## Fixed Effects:
## 
## Conditional model:
##                               (Intercept)  
##                                    4.4345  
##                  scale(standardised_time)  
##                                   -0.4922  
##                          scale(n.threats)  
##                                   -0.7201  
## scale(standardised_time):scale(n.threats)  
##                                   -0.5032  
## 
## Zero-inflation model:
## (Intercept)    n.threats  
##      -8.471        1.810

We can see that the random effect of species is quite small, so we could consider removing this from the model with autocorrelation and trying it again:

##zero inflated negative binomial model with autocorrelation and population as random effects
Zi_nb1_ar1_mod <- glmmTMB(abundance ~ scale(standardised_time) * scale(n.threats) + (1|population) + ar1(factor(scale(standardised_time))-1|population), 
                  data = species_single0,
                  ziformula=~n.threats,
                  family="nbinom1",
                  ##the control parameter specifying the optimizer to use:
                  control=glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")))
## Warning in fitTMB(TMBStruc): Model convergence problem; extreme or very small
## eigenvalues detected. See vignette('troubleshooting')

However you can see we are still getting the same error so we will conclude that in this case the model is probably over parameterised (1). So let’s fall back to our previous best model -Zi_nb1_mod. If we thought there was likely to be significant autocorelation structure then we would need to think about going back and simplifying our original model.

2.0.11 Model diagnostics with DHARMa

Time - 5 minutes

The DHARMa package allows us to implement some great diagnostic tools to check our model fits (only defined for some distributions at the moment, luckily the zero inflated NB is one of them):

##load DHARMa
library(DHARMa)

## simualte the residuals from the model
##setting the number of sims to 1000 (more better, according to the DHARMa help file)
Zi_nb1_mod_sim <- simulateResiduals(Zi_nb1_mod, n = 1000)

##plot out the residuals
plot(Zi_nb1_mod_sim)

We can interpret the QQ plot as normal - and it looks good! We have no significant values for any of the tests (which would indicate that the distribution is a poor fit to the data).

On the right we have a plot of the predictions vs residuals. Here we are hoping to see straight red solid lines across the plot which match the dashed red lines on the plot, but as you can see there is some deviation from that and we are getting the warning that quantile deviations detected in the residuals. This isn’t ideal but visually doesn’t appear very bad in this case. Ideally we would delve into this some more (see below in More possible models). There is an excellent (and long) vignette for DHARMa available here which you can delve into.

2.0.12 Additional DHARMa tests

Time - 10 minutes

DHARMa also provides a suite of other test functions which allow us to run diagnostics on the simulated outputs from simulateResiduals():

## test to see where there are outliers, in our case not significant so we dont need to worry
testOutliers(Zi_nb1_mod_sim, 
             plot = TRUE)

## 
##  DHARMa outlier test based on exact binomial test with approximate
##  expectations
## 
## data:  Zi_nb1_mod_sim
## outliers at both margin(s) = 0, observations = 600, p-value = 0.6382
## alternative hypothesis: true probability of success is not equal to 0.001998002
## 95 percent confidence interval:
##  0.000000000 0.006129271
## sample estimates:
## frequency of outliers (expected: 0.001998001998002 ) 
##                                                    0
## tests if the simulated dispersion is equal to the observed dispersion,
## again not significant so no need to worry
testDispersion(Zi_nb1_mod_sim, 
               plot = TRUE)

## 
##  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
##  simulated
## 
## data:  simulationOutput
## dispersion = 0.55531, p-value = 0.39
## alternative hypothesis: two.sided
## compares the distribution of expected zeros in the data against the observed zeros
## this is right on the borderline of being significant, suggesting there might be a better structure for
## our zero inflation parameter (remember we used ~n.threats). That might be worth looking into further
testZeroInflation(Zi_nb1_mod_sim, 
                  plot = TRUE)

## 
##  DHARMa zero-inflation test via comparison to expected zeros with
##  simulation under H0 = fitted model
## 
## data:  simulationOutput
## ratioObsSim = 0.57281, p-value = 0.05
## alternative hypothesis: two.sided
## see if there is temporal autocorrelation in the residuals
## not significant, so it turns out we didnt need to try and fit the autocorrelation model earlier on!
testTemporalAutocorrelation(Zi_nb1_mod_sim,
                             time = species_single0$standarised_time,
                              plot = TRUE) 
## Warning: Unknown or uninitialised column: `standarised_time`.
## DHARMa::testTemporalAutocorrelation - no time argument provided, using random times for each data point

## 
##  Durbin-Watson test
## 
## data:  simulationOutput$scaledResiduals ~ 1
## DW = 2.0287, p-value = 0.7248
## alternative hypothesis: true autocorrelation is not 0

So overall it seems like our model is a pretty good fit to the observed data.

2.0.13 What does it mean when a model doesn’t fit properly?

Time - 5 minutes

It might be worth just briefly considering what the implications of a poorly fitting model are, and how much of an issue is it? Well this really depends upon the questions you are interested in, and what you want to do with your model.

In our case, we are interested to see if there is a significant effect of the number of threats on how population abundances change through time. In this case, we aren’t neccesarily interested in the exact values of the coefficients (if the effect of an additional stressor is -0.10224 vs -0.10890 doesn’t make a big difference to us). We do want to be sure that the patterns we are seeing in the model are reflected in the data, and the fact that they are highly significant and the effect sizes are large means we can be quite confident in this even if the model fit isn’t “perfect”.

However, if we have marginal effects with small effect sizes then we are likely to be more concerned about how well our model fits to be sure whether the effects we are reporting are in fact observed in the data, rather than a function of a poor model fit.

Likewise, if we are fitting a model to data and then using that model to - say - parameterise a simulation, or project the trends beyond the data we have collected then we are again going to be much more concerned about how well our model fits, as any errors in the coefficient estimates are going to be propagated as we use those estimates in other analyses. In this case we might want to fit something more flexible to our data, like a Generalised Additive Model (GAMs) or a Generalised Additive Mixed Model (GAMMs) which can deal well with non-linear effects and thus will make reliable predictions within the bounds of the data you are fitting the model to, but are difficult to interpret when trying to identify the effects of fixed effects and cannot be used to extrapolate beyond the data you have. You can see the flexibility of GAMs below (purple line) vs a linear model (yellow line).

In our case, we will will continue with the best model we identified above and look at interpreting and plotting the outputs.

2.0.14 Final visual checks

Time - 5 minutes

As you can see it has taken a long time to get to here, and if your data is very complex, or doesn’t conform to a well-used distribution, this process takes even longer. I like to do one final easy check to look at how well our model is working, and this is to simply compare the model predictions to the observed values. We will go back to our old friend the predict()function for this:

## add in the predicted values from the model:
species_single0$predicted <- predict(Zi_nb1_mod, 
                                     data = species_single0, 
                                     type = "response")

##plot the predicted against the observed
ggplot(species_single0, aes(x = abundance, 
                            y = predicted)) + 
  geom_point(col="grey") + 
  geom_abline(slope = 1) +
  theme_minimal() +
  xlab("Observed") +
  ylab("Predicted")

The points should then fall close to the 1:1 line, and you can see our model does a pretty good job. We can see that it both underpredicts and over predicts the values in some places, but we expect that (we won’t ever have a model which fits perfectly, after all - the whole point of a model is to simplify the patters in data so we can interpret them) and there aren’t any clear patterns (like only underpredicting at high observed values).

Phew! All that effort was worth it!

There are of course a lot more ways to go about visualising models, and Ben Bolker gives a much more detailed deep dive into diagnostic plotting here, which I highly reccomend you look through.

2.0.15 More possible models

We said before we might want to consider a more complex zero-inflation structure than the one we currently have, and we may also want to think about explicitly modelling a dispersion parameter (although our tests above don’t highlight this as a particular issue, its worth knowing this is possible), e.g. n.threats might be a predictor of increased dispersion in the data. glmmTMB allows us to do this, but we aren’t going to dig into that today. However if we weren’t confident about the model fits (low pseudo r2, poor predicted vs observed plots, etc), or wanted to check some other model structures, that would be a good place to start.

2.0.16 What does our model ACTUALLY SAY?

Time - 2 minutes

So finally, what does our model say?


Task

Print the summary of the Zi_nb1_mod.

##summarise the model 
summary(Zi_nb1_mod)
##  Family: nbinom1  ( log )
## Formula:          
## abundance ~ scale(standardised_time) * scale(n.threats) + (1 |  
##     species/population)
## Zero inflation:             ~n.threats
## Data: species_single0
## 
##      AIC      BIC   logLik deviance df.resid 
##   6041.6   6081.2  -3011.8   6023.6      591 
## 
## Random effects:
## 
## Conditional model:
##  Groups             Name        Variance  Std.Dev.
##  population:species (Intercept) 5.013e-01 0.707994
##  species            (Intercept) 1.095e-07 0.000331
## Number of obs: 600, groups:  population:species, 59; species, 51
## 
## Dispersion parameter for nbinom1 family (): 15.2 
## 
## Conditional model:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                4.43446    0.09774   45.37  < 2e-16
## scale(standardised_time)                  -0.49215    0.02231  -22.06  < 2e-16
## scale(n.threats)                          -0.72013    0.09532   -7.55 4.21e-14
## scale(standardised_time):scale(n.threats) -0.50318    0.02139  -23.52  < 2e-16
##                                              
## (Intercept)                               ***
## scale(standardised_time)                  ***
## scale(n.threats)                          ***
## scale(standardised_time):scale(n.threats) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Zero-inflation model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.4707     1.3971  -6.063 1.34e-09 ***
## n.threats     1.8103     0.5485   3.300 0.000966 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well, it says:

  1. on average as time increases abudances decline
  2. as the number of threats increases the average population size decreases
  3. the interaction states that as those species with more threats decline faster through time than those with fewer threats

2.0.17 Plotting model outputs

Time - 15 minutes

There are a host of ways to plot out but this one is adapted from the glmmTMB paper appendix:

## make a blank data set which includes the variables in the model
## we will then use this to generate predicted values from the model for 
## various combinations of number of threats, standardised time, species,
## and population
## we can use the unique() function across the columns in our data.frame
## to retrieve every unique combination of:
##n.threats, standardised_time, species, and population
new_data <-  unique(species_single0[,c("n.threats",
                                       "standardised_time", 
                                       "species", 
                                        "population")])

##scale the relevant columns (remember our model is expecting scaled data)
new_data$n.threats<-scale(new_data$n.threats)
new_data$standardised_time<-scale(new_data$standardised_time)

##set the random effects of the model to zero
X_cond <- model.matrix(lme4::nobars(formula(Zi_nb1_mod)[-2]), new_data)
beta_cond <- fixef(Zi_nb1_mod)$cond
pred_cond <- X_cond %*% beta_cond
ziformula <- Zi_nb1_mod$modelInfo$allForm$ziformula
X_zi <- model.matrix(lme4::nobars(ziformula), new_data)
beta_zi <- fixef(Zi_nb1_mod)$zi
pred_zi <- X_zi %*% beta_zi

##transform point estimates of the unconditional counts to the response scale and multiply
##(because they are logged and on logit link)
pred_ucount = exp(pred_cond)*(1-plogis(pred_zi))

##load the MASS library
library(MASS)

##set the random number generator seed
set.seed(101)

##use posterior predictive simulations to generated upper and lower confidence intervals
## and median predicted counts
##ignoring variabtion in the random effects

##conditional
pred_condpar_psim = mvrnorm(1000,mu=beta_cond,Sigma=vcov(Zi_nb1_mod)$cond)
pred_cond_psim = X_cond %*% t(pred_condpar_psim)

##zero inflation parameter
pred_zipar_psim = mvrnorm(1000,mu=beta_zi,Sigma=vcov(Zi_nb1_mod)$zi)
pred_zi_psim = X_zi %*% t(pred_zipar_psim)

##transform them
pred_ucount_psim = exp(pred_cond_psim)*(1-plogis(pred_zi_psim))

##calculate 95% CIs
ci_ucount = t(apply(pred_ucount_psim,1,quantile,c(0.025,0.975)))
ci_ucount = data.frame(ci_ucount)

##rename
names(ci_ucount) = c("ucount_low","ucount_high")

##put into a data frame
pred_ucount = data.frame(new_data, 
                         pred_ucount, 
                         ci_ucount)

##we need to reverse the scaling of our predictor variables for our plots to make sense
##the scale() function stores attributes of the scaling in the vectors of scaled data 
## try running new_data$n.threats and looking at the bottom values
##write a function to do this:
unscale <- function(x){
  x * attr(x, 'scaled:scale') + attr(x, 'scaled:center')
}

##unscale the variables
pred_ucount$n.threats_unscaled <- unscale(pred_ucount$n.threats)
pred_ucount$standardised_time_unscaled <- unscale(pred_ucount$standardised_time)

##load the viridis package (colourblind friendly palletes)
library(viridis)

##plot out the predicted median values for abundance
## in response to time (x-axis)
##and grouped by the number of threats
ggplot(pred_ucount, aes(x = standardised_time_unscaled, 
                        y = pred_ucount, 
                        group = n.threats_unscaled, 
                        col = n.threats_unscaled))+ 
  ##median lines for each number of threats
  geom_line() +
  ##add in a geom_ribbon to show the 95% CI
  geom_ribbon(aes(ymin = ucount_low,
                  ymax = ucount_high), 
              alpha = 0.1, 
              col = "grey", 
              linetype=0) +
  ##minimal theme
  theme_minimal() +
  ##set x and y axes labels
  ylab("Predicted\nabundance") + xlab("Time\n(weeks)") +
  ##viridis colour pallette for continuous data
  scale_colour_viridis_c() +
  ##move legend to the top
  theme(legend.position = "top") +
  ##rename the legend
  labs(colour="Number of threats")

Brilliant! we have a clear visualisation of the effects of multiple threats on the dynamics of populations through time, and we can say with some certainty that the more stressors you have the faster you are likely to be declining.

3 Key Concepts

Let’s quickly summarise what we have learned today:

  • What is a GLMM
  • How to fit a GLMM
  • How to decide on nested/crossed random effects
  • How to account for temporal autocorrrelations
  • How to visualise residuals (difference between predicted and observed)
  • How to tell if a model is fitting our data well
  • How to explore alternative models
  • How to compare models
  • How to interpret model summaries
  • How to plot the results of GLMMs

Phew! Thats a lot to cover in one week. Don’t worry if you didn’t manage to work through everything or some parts you find difficult - you can always come back to this and refer to it at any time.

4 Functions covered today

Function/operator Call
Apply smoothing visualisation to ggplot() geom_smooth()
Calculate difference between dates difftime()
Run a GLM model glm()
Return predicted values from a model predict()
Return the residuals of a glm resid()
Make a QQ plot qqnorm()
Add a QQ line to a QQ plot qqline()
Compare models using AIC AIC()
Summarise a model summary()
Order the items in a column order()
Fit a glmm with TBM glmmTMB
Create scaled residuals from a glmm simulateResiduals()
Apply a function to grouped data in tidyverse group_map()
Fit a distribution to observations fitdist()
Generate summary statistics from a fitted distribution gofstat()
Calculate pseudo R-squared for glmms r.squaredGLMM()
Update a model with a new argument update()
Compare models with AIC and give more info anova()
Test for significant outliers on simulateResiduals() output testOutliers()
Test for significant dispersion on simulateResiduals() testDispersion()
Test for significant zero inflation on simulateResiduals() testZeroInflation()
Test for significant temporal autocorrelation in the residuals from simulateResiduals() testTemporalAutocorrelation()
Return fixed effects of glmm fixef()
Compute exponential exp()
Plot a ribbon geom geom_ribbon()
Apply a continuous viridis colour scheme scale_colour_viridis_c()

7 Homework

7.1 Homework solution

First of all there isn’t a right or a wrong way of doing last week’s homework. Actually, this is usually the case with R - there are better (easier to read code, less likely to make mistakes, faster) and worse (the opposite of all of those things) ways of doing things but not neccesarily a right and wrong way. Your homework was about thinking outside of the box, being creative, investigating the data, and trying to come up with an elegent visualisation. Here is one option, working from the long format of the data you generated last week:

7.1.1 Part 1 - Tokyo

##load packages
library(tidyverse)
library(vroom)
library(wbstats)
library(countrycode)

##read in data
tokyo <- vroom("https://raw.githubusercontent.com/chrit88/Bioinformatics_data/master/Workshop%205/Tokyo%202021%20medals.csv")

##add in position number
tokyo$Position <- 1:nrow(tokyo)

##pull GDP data
GDP <- wb_data(indicator = "NY.GDP.MKTP.CD", 
        start_date = 2019, 
        end_date = 2019)

tokyo$code <- countrycode(tokyo$Country, 
                                  origin = "country.name", 
                                  destination = "iso3c")

##correct china
tokyo$code[2] <-  "CHN"

## join data
tokyo <- left_join(tokyo, 
                   GDP %>% select(iso3c, NY.GDP.MKTP.CD),
                         by = c("code" = "iso3c"))

## rename gdp to make it easier to work with
names(tokyo)[7] <- "GDP"

##visualise the data
ggplot(tokyo, aes(x=GDP, y=Position)) + 
  geom_point() + 
  theme_bw()

##look at a plot where the data are logged
ggplot(tokyo, aes(x=GDP, y=Position)) + 
  geom_point() + 
  scale_y_continuous(trans='log10') + 
  scale_x_continuous(trans='log10') + 
  theme_bw() +
  ggtitle("Logged data")


## fit a model where both the x and y are logged:
mod1 <- glm(Position ~ GDP, data=tokyo)
plot(mod1)
## DOESNT LOOK GOOD!

## fit a model where both the x and y are logged:
mod2 <- glm(log10(Position) ~ log10(GDP), data=tokyo)
plot(mod2)

##so, is there an effect?
summary(mod2)

7.1.2 Part 2 - Iris

##load the data
data(iris)

##visualize the data
iris1 <- ggplot(iris, aes(x = Petal.Length, y = Petal.Width)) + 
  geom_point(aes(col = Species)) +
  theme_bw()

iris1

## add in some simple linear regessions to visualise possible effects
##specifying col = "Species" means you will plot a differnt LM for each species (as colour is grouped)
iris1 + geom_smooth(aes(col = Species), method="lm")
## `geom_smooth()` using formula 'y ~ x'

##so looks like a positive effect, lets build a model.
##initially fit with gaussian (normal) as data are non-integer and a priori we would probably expect 
##a normal distribution. using * means we are including both the single effects, and the interactive effect
mod1 <- glm(Petal.Width ~ Petal.Length*Species, data = iris)
##set a 2x2 plot area, so we get a single pannel with 4 plots:
par(mfrow = c(2, 2))
##qqplot looks a bit uneven
plot(mod1)

##try with some data transformations:
mod2 <- glm(log(Petal.Width) ~ Petal.Length*Species, data = iris)
##this has made things much worse!
plot(mod2)

##what about square root?
mod3 <- glm(sqrt(Petal.Width) ~ Petal.Length*Species, data = iris)
##this looks much better:
plot(mod3)

##we don't need to check the AIC as we don't have two models which fit equally well 

##look at the model summary:
summary(mod3)
## 
## Call:
## glm(formula = sqrt(Petal.Width) ~ Petal.Length * Species, data = iris)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.23961  -0.04583  -0.00825   0.04838   0.26267  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.20974    0.10091   2.079  0.03943 *  
## Petal.Length                    0.18887    0.06855   2.755  0.00662 ** 
## Speciesversicolor               0.31788    0.14821   2.145  0.03365 *  
## Speciesvirginica                0.89496    0.15704   5.699 6.57e-08 ***
## Petal.Length:Speciesversicolor -0.04317    0.07308  -0.591  0.55565    
## Petal.Length:Speciesvirginica  -0.13206    0.07186  -1.838  0.06816 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.006943608)
## 
##     Null deviance: 24.42219  on 149  degrees of freedom
## Residual deviance:  0.99988  on 144  degrees of freedom
## AIC: -311.93
## 
## Number of Fisher Scoring iterations: 2

So what are our conclusions from this?

  1. there is a significant value for the intercept (so the intercept is significantly different from 0). This is pretty irrelevant
  2. there is a significant effect of Petal.Length on Petal.Width across all the species
  3. the intercept for versicolor and virginica are significantly different from that of setosa (which, remember, is the intercept of the mode)
  • you can see this clearly in the below graph
  1. there is no interactive effect of Petal.Length:Speciesversicolor or Petal.Length:Speciesvirginica
  • this means that the slope of the effect of Petal.Length on Petal.Width doesn’t change between the different species
  • ALTHOUGH the effect is very close to significant for virginica, suggesting this might be an artifact of the fairly low number of data points

7.2 This week’s homework

So this week’s homework will build on our delve into fitting GLMs to data. In your cloned Bioinformatics_data repository in the Workshop 5 folder you will find four different data sets. You will need to analyse these and for each identify:

  1. what the error distribution of the data are
  2. which of the predictor (x) variables have a significant effect on the y variable

You can do these in a GLM framework as there aren’t any random effects in the data. You do not need to look for any interactions between the predictor variables, and in this case do not need to consider transforming your data. Try to identify the best model fits you can, but you won’t get a perfect fit!

As with last week’s homework, write up a short Rmarkdown document detailing what you have done and the steps you have taken.

Answers to be revealed at the start of next week!